You are here: Home Blog XtreemOS

checkpoint restart mechanisms

2010-11-04

New demo video online: Checkpointing and restart

We are pleased to announce that we have another new demo video on-line: the checkpointing demo by John Mehnert-Spahn from the University of Duesseldorf.

http://vimeo.com/16494297

All our technical video demos are avilable here

2009-02-12

XtreemOS-related paper accepted at CCGrid 2009

Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems

 

Authors: Pierre Riteau, Adrien Lebre and Christine Morin

CCGrid 2009 

Abstract

Computer clusters are today the reference architecture for high-performance computing.
The large number of nodes in these systems induces a high failure rate. This makes fault tolerance mechanisms, e.g. process checkpoint/restart, a required technology to effectively exploit clusters.
Most of the process checkpoint/restart implementations only handle volatile states and do not take into account persistent states of applications, which can lead to incoherent application restarts.
In this paper, we introduce an efficient persistent state checkpoint/restoration approach that can be interconnected with a large number of file systems. To avoid the performance issues of a stable support relying on synchronous replication mechanisms, we present a failure resilience scheme optimized for such persistent state checkpointing techniques in adistributed environment. First evaluations of our implementation in the kDFS distributed file system show the negligible performance impact of our proposal.